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Abstract Monte Carlo methods are used to approximate the means, /i, of random 
variables Y, whose distributions are not known explicitly. The key idea is that the 
average of a random sample, Y\, . . . ,Y n , tends to ji as n tends to infinity. This article 
explores how one can reliably construct a confidence interval for ji with a prescribed 
half-width (or error tolerance) £. Our proposed two-stage algorithm assumes that the 
kurtosis of Y does not exceed some user-specified bound. An initial independent and 
identically distributed (IID) sample is used to confidently estimate the variance of 
Y. A Berry-Esseen inequality then makes it possible to determine the size of the IID 
sample required to construct the desired confidence interval for jj.. We discuss the 
important case where Y = f(X) and X is a random d- vector with probability density 
function p . In this case jj, can be interpreted as the integral Ld f(x)p(x) dx, and the 
Monte Carlo method becomes a method for multidimensional cubature. 



1 Introduction 

Monte Carlo algorithms provide a flexible way to approximate = E(F) when 
one can generate samples of the random variable Y. For example, Y might be the 
discounted payoff of some financial derivative, which depends on the future perfor- 
mance of assets that are described by a stochastic model. Then ji is the fair option 
price. The goal is to obtain a confidence interval 
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Pr[|/i-AI <£]>!-«, (1) 

where 

• H is approximated by the sample average of n independent and identically dis- 
tributed (IID) samples of Y, 

fL=fL„ = -YYi, (2) 

• e is the half-width of the confidence interval, which also serves as an error toler- 
ance, and 

• a is the level of uncertainty, e.g., 1% or 0.1%, which is fixed in advance. 

Often the sample size, n, is fixed in advance, and the central limit theorem (CLT) 
provides an approximate value for e in terms of n and 

a 2 = Var(y)=E[(y-Ai) 2 ], (3) 

which itself may be approximated by the sample variance. The goal here is some- 
what different. We want to fix e in advance and then determine how large the sample 
size must be to obtain a fixed width confidence interval of the form (1). Moreover, 
we want to make sure that our confidence interval is correct, not just approximately 
correct, or correct in the limit of vanishing e. In this paper we present Algorithm 1 
for obtaining such a fixed width confidence interval for the mean of a real random 
variable when one is performing Monte Carlo sampling. 

Before presenting the method, we outline the reasons that existing fixed width 
confidence intervals are not suitable. In summary, there are two drawbacks of ex- 
isting procedures. Much existing theory is asymptotic, i.e., the proposed procedure 
attains the desired coverage level in the limit as e — >• but does not provide coverage 
guarantees for fixed £ > 0. We want such fixed e guarantees. A second drawback is 
that the theory may make distributional assumptions that are too strong. In Monte 
Carlo applications one typically does not have much information about the underly- 
ing distribution. The form of the distribution for Y is generally not known, Var(F) 
is generally not known, and Y is not necessarily bounded. We are aiming to derive 
fixed width confidence intervals that do not require such assumptions. 

The width (equivalently length) of a confidence interval tends to become smaller 
as the number n of sampled function values increases. In special circumstances, we 
can choose n to get a confidence interval of at most the desired length and at least the 
desired coverage level, 1 — a. For instance, if the variance, a 2 = Var(F), is known 
then an approach based on Chebychev's inequality is available, though the actual 
coverage will usually be much higher than the nominal level, meaning that much 
narrower intervals would have sufficed. Known variance in addition to a Gaussian 
distribution for Y supports a fixed width confidence interval construction that is not 
too conservative. The CLT provides a confidence interval that is asymptotically cor- 
rect, but our aim is for something that is definitely correct for finite sample sizes. 
Finally, conservative fixed width confidence intervals for means can be constructed 
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for bounded random variables, by appealing to exponential inequalities such as Ho- 
effding's or Chernoff's inequality. Unfortunately, Y is often unbounded, e.g., in the 
case where it represents the payoff of a call option. 

If the relevant variance or bound is unknown, then approaches based on sequen- 
tial statistics (Siegmund, 1985) may be available. In sequential methods one keeps 
increasing n until the interval is narrow enough. Sequential confidence intervals re- 
quire us to take account of the stopping rule when computing the confidence level. 
Unfortunately, all existing sequential methods are lacking in some aspects. 

Serfiing and Wackerly (1976) consider sequential confidence intervals for the 
mean (alternatively for the median) in parametric distributions, symmetric about 
their center point. The symmetry condition is not suitable for general purpose Monte 
Carlo applications. 

Chow and Robbins (1965) develop a sequential sampling fixed width confidence 
interval procedure for the mean, but its guarantees are only asymptotic (as e — > 0). 
Mukhopadhyay and Datta (1996) give a procedure similar to Chow and Robbins', 
and it has similar drawbacks. 

Bayesian methods can support a fixed width interval containing ji with 1 — a 
posterior probability, and Bayesian methods famously do not require one to account 
for stopping rules. They do however require strong distributional assumptions. 

There is no assumption-free way to obtain exact confidence intervals for a mean, 
as has been known since Bahadur and Savage (1956). Some kind of assumption is 
needed to rule out settings where the desired quantity is the mean of a heavy tailed 
random variable in which rarely seen large values dominate the mean and spoil the 
estimate of the variance. The assumption we use is an upper bound on the modified 
kurtosis (normalized fourth moment) of the random variable Y: 



(The quantity k— 3 is commonly called the kurtosis.) Under such an assumption we 
present a two-stage algorithm: the first stage generates a conservative upper bound 
on the variance, and the second stage uses this variance bound and a Berry-Esseen 
Theorem, which can be thought of as a non-asymptotic CLT, to determine how 
large n must be for the sample mean to satisfy confidence interval (1). Theorem 
5 demonstrates the validity of the fixed width confidence interval, and Theorem 6 
demonstrates that the cost of this algorithm is reasonable. These are our main new 
theoretical results. 

Our procedure is a two-stage procedure rather than a fully sequential one. In 
this it is similar to the method of Stein (1945, 1949), except that the latter requires 
normally distributed data. 

One might question whether assumption (4), which involves fourth moments of 
Y, is more reasonable than an assumption involving only the second moment of Y. 
For example, using Chebychev's inequality with the assumption 
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also yields a fixed width confidence interval of the form (1). We would argue that (4) 
is indeed more reasonable. First, if Y satisfies (4), then so does cY for any nonzero 
c, however, the analog does not hold for (5). In fact, if a is nonzero, then (5) must 
be violated by cY for c sufficiently large. Second, making KVnax a factor of 10 or 
100 larger than k does not significantly affect the total cost (number of samples 
required) of our two-stage Monte Carlo Algorithm 1 for a large range of values of 
a/e. However, the cost of our Monte Carlo algorithm, and indeed any Monte Carlo 
algorithm based on IID sampling is proportional to a 2 , so overestimating a 2 by a 
factor of 10 or 100 or more to be safe increases the cost of the algorithm by that 
factor. 

An important special case of computing p = E(F) arises in the situation where 
Y = f(X) for some function / : M. d M and some random vector X with probabil- 
ity density function p : W 1 — > [0,°°). One may then interpret the mean of Y as the 
multidimensional integral 

M = M (/)=E(F)= / f(x)p(x)dx. (6) 

Note that unlike the typical probability and statistics setting, where / denotes a 
probability density function, in this paper / denotes an integrand, and p denotes 
the probability density function. Given the problem of evaluating p = J R d g(x) dx, 
one must choose a probability density function p for which one can easily generate 
random vectors X, and then set / = g/p . The quantities a 2 and k defined above can 
be written in terms of weighted J-? ; ,-norms of /: 

\\f\\ p :=\ [ \f(x)\ p p(x)dxY /P , a 2 = \\f-p\\ 2 2 , k=^-^. (7) 
Uu" J II/-MII2 

For a given g, the choice of p is not unique, and making an optimal choice belongs 
to the realm of importance sampling. The assumption of bounded kurtosis, (4), re- 
quired by Algorithm 1, corresponds to an assumption that the integrand / lies in the 
cone of functions 

tf«b» = {/e ^4 : ||/-At(/)|| 4 < ^ll/-M(/)|| 2 }- (8) 

This is in contrast to a ball of functions, which would be the case if one was satis- 
fying a bounded variance condition, (5). 

From the perspective of numerical analysis, if p has independent marginals, one 
may apply a product form of a univariate quadrature rule to evaluate p. However, 
this consumes a geometrically increasing number of samples as d increases, and 
moreover, such methods often require rather strict smoothness assumptions on /. 

If / satisfies moderate smoothness conditions, then (randomized) quasi-Monte 
Carlo methods, or low discrepancy sampling methods for evaluating p are more ef- 
ficient than simple Monte Carlo (Niederreiter, 1992; Sloan and Joe, 1994; Lemieux, 
2009; Dick and Pillichshammer, 2010). Unfortunately, practical error estimation re- 
mains a challenge for quasi-Monte Carlo methods. Heuristic methods have been 
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proposed, but they lack theoretical justification. One such heuristic is used with 
reasonable success in the numerical examples of Section 4. Independent random- 
izations of quasi-Monte Carlo rules of fixed sample size can be used to estimate 
their errors, but they do not yet lead to guaranteed, fixed width confidence intervals. 

Computational mathematicians have also addressed the problem of constructing 
automatic algorithms, i.e., given an error tolerance of e, one computes an approx- 
imation, p., based on « evaluations of the integrand /, such that |jtt — p.\ < e. For 
example, MATLAB (The MathWorks, Inc., 2012), a popular numerical package, 
contains quad, an adaptive Simpson's rule for univariate quadrature routine devel- 
oped by Gander and Gautschi (2000). Although quad and other automatic rules 
generally work well in practice, they do not have any rigorous guarantees that the 
error tolerance is met, and it is relatively simple to construct functions that fool 
them. This is discussed in Section 4. Since a random algorithm, like Monte Carlo, 
gives a random answer, any statements about satisfying an error criterion must be 
probabilistic. This leads us back to the problem of finding a fixed width confidence 
interval, (1). 

An outline of this paper follows. Section 2 defines key terminology and provides 
certain inequalities used to construct our fixed width confidence intervals. The new 
two-stage Algorithm 1 is described in Section 3, where rigorous guarantees of its 
success and its cost are provided. Section 4 illustrates the challenges of computing 
H to a guaranteed precision through several numerical examples. This paper ends 
with a discussion of our results and further work to be done. 



2 Background probability and statistics 

In our Monte Carlo applications, a quantity of interest is written as an expectation: 
jj. = E(F), where F is a real valued random variable. As mentioned above, very 
often Y = f(X) where X £ M. d is a random vector with probability density function 
p. In other settings the random quantity X might have a discrete distribution or be 
infinite dimensional (e.g,. a Gaussian process) or both. For Monte Carlo estimation, 
we can work with the distribution of Y alone. The Monte Carlo estimate of jj. is the 
sample mean, as given in (2), where the F, are IID random variables with the same 
distribution as F. 



2.1 Moments 

Our methods require conditions on the first four moments of F as described here. 
The variance of F, as defined in (3), is denoted by a 2 , and its non-negative square 
root, a, is the standard deviation of F. Some of our expressions assume without 
stating it that a > 0, and all will require a < °°. The skewness of F is y = E[(F — 
Ai) 3 ]/a 3 , and the kurtosis ofFisJC=K:-3 = E[(F - y,) A ]/o A - 3 (see (4)). The 
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mysterious 3 in k is there to make it zero for Gaussian random variables. Also, 
jl , o 2 , y, K are related to the first four cumulants (McCullagh, 1987, Chap. 2) of the 
distribution of Y, meaning that 

, o 2 t 2 ycr 3 / 3 xo A t A , ds 

log(E[exp(/y)]) = tit + — + + __ + (,4). 

Our main results require a known upper bound for K, which then implies that a and 
y are finite. 



2.2 CLT intervals 

A random variable Z has the standard normal distribution, denoted by ^V(0, 1), if 
Pr(Z <z) = -f= f exp(-f 2 /2)df =: <p( z ). 

Under the central limit theorem, the distribution of y/n(p.„ — ju)/<7 approaches 
c/K(0, 1) as « — ^ oo, where /t„ denotes the sample mean of n IID samples. As a 
result 

Pr(£„- 2.58a^ < At <A„ + 2.58a/V^) ^0.99 (9) 

as n — ^ oo. We write the interval in (9) as /i„ ± 2.58a/s/n. Equation (9) cannot be 
used when a 2 is unknown, but the usual estimate 

^ = ^ T f(y-A») 2 do) 

may be substituted, yielding the interval jl u ±2.5%s n / ^Jn which also satisfies the 
limit in (9) by Slutsky's theorem (Lehmann and Romano, 2005). For an arbitrary 
confidence level 1-ae (0, 1), we replace the constant 2.58 by z a /2 = — 
a/2). The width of this interval is 2z a /2Sn/\/n, and when fi is in the interval then 
the absolute error [ jU. — jQ n | < e := z a ji s nl \fn. 

The coverage level of the CLT interval is only asymptotic. In more detail, (Hall, 
1988, p. 948) shows that 

Pr(|M-A»l<2.58i/V«) =0.99 + i(A+By 2 + CK-) + o(^) (11) 

for constants A, B, and C that depend on the desired coverage level (here 99%). 
Hall's theorem requires only that the random variable Y has sufficiently many finite 
moments and is not supported solely on a lattice (such as the integers). It is interest- 
ing to note that the 0(1 /n) coverage error in (11) is better than the 0(1 / \fn) root 
mean squared error for the estimate jX n itself. 
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2.3 Standard Probability Inequalities 



Here we present some well known inequalities that we will use. First, Chebychev's 
inequality ensures that a random variable (such as fx n ) is seldom too far from its 
mean. 

Theorem 1 (Chebychev's Inequality). (Lin and Bai, 2010, 6.1c, p. 52) Let Zbe a 

random variable with mean jj, and variance a 2 > 0. Then for all £ > 0, 

Pr[|Z-Ml>£]<fJ. 

In some settings we need a one sided inequality like Chebychev's. We will use 
this one due to Cantelli. 

Theorem 2 (Cantelli's Inequality). (Lin and Bai, 2010, 6.1e, p. 53) Let Z be any 

random variable with mean fl and finite variance a 2 . For any a > 0, it follows that: 

a 2 



Pr[Z-ju > a] < 

Berry-Esseen type theorems govern the rate at which a CLT takes hold. We will 
use the following theorem which combines recent work on both uniform and non- 
uniform (jt-dependent right hand side) versions. 

Theorem 3 (Berry-Esseen Inequality). Let Y\ , . . . , Y„ be IID random variables 
with mean fl, variance a 2 > 0, and third centered moment M3 = E \ Yj — fl | 3 /<7 3 < 
00. Let (x n = (Y\ H \-Y,,)/n denote the sample mean. Then 



Pr 



< x 



*(*) 



< A n {x,M 3 ) := ^min ( A t (M 3 +A 2 ), ^% I V* £ R, 

v« V i + W y 

where Ax =0.3328 and A 2 = 0.429 (Shevtsova, 2011), andA 3 = 18.1 139 (Nefedova 
and Shevtsova, 2012). 

The constants in the Berry-Esseen Inequality above have been an area of active 
research. We would not be surprised if there are further improvements in the near 
future. 

Our method requires probabilistic bounds on the sample variance, s 2 . For that, 
we will use some moments of the variance estimate. 

Theorem 4. (Miller, 1986, Eq. (7.16), p. 265) Let Y u ...,Y n be IID random vari- 
ables with variance O 2 and modified kurtosis k defined in (4). Let s 2 be the sample 
variance as defined in (10). Then the sample variance is unbiased, E(i^) = <7 2 , and 
its variance is 
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3 Two-stage confidence interval 

Our two-stage procedure works as follows. In the first stage, we take a sample of 
independent values Y\ , . . . , Y„ a from the distribution of Y. From this sample we com- 
pute the sample variance, \Jj , according to (10) and estimate the variance of Y, by 
<7 2 = £ 2 s 2 a , where £ 2 > 1 is a "variance inflation factor" that will reduce the prob- 
ability that we have underestimated a 2 = Var(F). For the second stage, we use the 
estimate a 2 as if it were the true variance of Y, and use Berry-Esseen theorem to 
obtain a suitable sample size, n^, for computing the sample average, fi, that satisfies 
the fixed with confidence interval (1). 

The next two subsections give details of these two steps that will let us bound 
their error probabilities. Then we give a theorem on the method as a whole. 



3.1 Conservative variance estimates 

We need to ensure that our first stage estimate of the variance a 2 is not too small. 
The following result bounds the probability of such an underestimate. 

Lemma 1. Let Y\,. . . , F„ be IID random variables with variance a 2 > and kurtosis 
K. Let s 2 be the sample variance defined at (10), and let k = K + 3. Then 



Pr 



Pr 



si < CT <l 



< > ° \ 1 - 



K — 



n — 3 \ / 1 - a 



n-\ 



K — 



n-l 



an 



n — 3\ ( 1 - a 



an 



> 1 -a, 



> 1 -a. 



(12a) 
(12b) 



Proof. Applying Theorem 4 and choosing 



Var(^) 



1 -a 
a 



_n_3^ fl^ )>0) 



n-l 



an 



it follows from Cantelli's inequality (Theorem 2) that 
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si- (J 2 > G 2 * 



K — 



< 



n-3 
n-1 

Var(^) 



1-a 



an 



= Pr \sl -a 2 >a\ 



Var(j*) 



a 2 + Var(^) Var(ja ) 1=2 + Var(^) 



1-cn 
a J 



a. 



Then (12a) follows directly. By a similar argument, applying Cantelli's inequality 
to the expression Pr [— s 2 + a 2 > a] implies (12b). □ 

Using Lemma 1 we can bound the probability that a 2 = £ 2 s 2 a overestimates a 2 . 
Equation (12a) implies that 



Pr 



> a 



l-a 
n a -l I \ an a 



> 1 -a. 



Thus, it makes sense for us to require the modified kurtosis, k, to be small enough, 
relative to n a , a, and £, in order to ensure that Pr((7 2 > a 2 ) >1 — a. Specifically, 
we require 



1 



1 - 



(ft— "g~ 3 ^ ( !-» 



or equivalently, 



k< 



n a -3 



n a - 1 

This condition is the explicit version of (4) mentioned in the introduction. 



(13) 



3.2 Conservative interval widths 

Here we consider how to choose the sample size tin to get the desired coverage level 
from an interval with half-length at most e. We suppose here that a is known. In 
practice we will use a conservative (biased high) estimate for a. 

First, if the CLT held exactly and not just asymptotically, then we could use a 
CLT sample size of 

AfcLT(e, O, a) = {^^-) 

independent values of Yj in an interval like the one in (9). 

Given knowledge of a, but no assurance of a Gaussian distribution for jl n , we 
could instead select a sample size based on Chebychev's inequality (Theorem 1). 
Taking 
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Afcheb(£,CT,a) = 



(14) 



IID observations of Y gives the confidence interval (1). Naturally Afcheb > Afcu- 

Finally, we could use the non-uniform Berry-Esseen inequality from Theorem 3. 
This inequality requires a finite scaled third moment M3 = E \Y, ■ — jU| 3 /(J 3 . If £l n 
denotes a sample mean of n IID random instances of Y, then the non-uniform Berry- 
Esseen inequality implies that 



Pr[|ji -£„|<£] =Pr 



M < Vne 



Pr 



> [^(^e/(7)-4„(V^e/(T,M3)] 

- [<P{-y/Ee/o)+A n (-y/?ie/o,M 3 )] 
= l-2[<P(-y/ne/o)+A n (y/ne/a,M 3 )}, 



(15) 



since A n (— x,M$) = A n (x,Mj). The probability of making an error no greater than 
e is bounded below by 1 — a, i.e., the fixed width confidence interval (1) holds with 
p. = fl n , provided n > A^be(£, ct, ot,M-s), where the Berry-Esseen sample size is 

Nbe{£,o,cc,M 3 ) :=min|n 6 N : 3> (— s/ne/a) +A n (^/ne/a,M 3 ) < — J. (16) 

To compute A^be(£, C, a,M3), we need to know M3. In practice, substituting an up- 
per bound on M3 yields an upper bound on the necessary sample size. 

Note that if the A„ term in (16) were absent, A^be would correspond to the CLT 
sample size Aclt, and in general A^be > Acer- It is possible that in some situations 
A^be > Afcheb might hold, and in such cases we could use Afcheb instead of A^be- 



3.3 Algorithm and Proof of Its Success 

In detail, the two-stage algorithm works as described below. 

Algorithm 1 (Two Stage). The user specifies four quantities: 

• an initial sample size for variance estimation, n a g {2,3, . . .}, 

• a variance inflation factor £ 2 6 ( 1 , °°), 

• an uncertainty a £ (0, 1), and, 

• an error tolerance or confidence interval half-width, e > 0. 

At the first stage of the algorithm, Y\ , . . . , Y„ a are sampled independently from 
the same distribution as Y. Then the conservative variance estimate, a 2 = € 2 sl , is 
computed in terms of the sample variance, s% , defined by (10). 

To prepare for the second stage of the algorithm we compute a = 1 — \/I — a 
and then K" max = K max (a,n a , £) using equation (13). The sample size for the second 
stage is 
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Hf, =N M (£,d,a,K- m / ax), (17) 

where 

N^(e,(7,a,M) :=max(l,min(At heb (e,(7,a),A^BE(e,ff,a,M))). (18) 

Recall that Afcheb is defined in (14) and A^be is defined in (16). 

After this preparation, the second stage is to sample F„ a+ i, . . . ,F„ CT +„„ indepen- 
dently from the distribution of Y, and independently of Y\ , . . . , Y Ha . The algorithm 
then returns the sample mean, 

, na+nfi 

£ = — E Y i- < 19 > 

% i'=n CT +l 



The success of this algorithm is guaranteed in the following theorem. The main 
assumption needed is an upper bound on the kurtosis. 

Theorem 5. Let Y be a random variable with mean \l, and either zero variance 
or positive variance with modified kurtosis k < k maK (a,n a ,(t). It follows that Al- 
gorithm 1 above yields an estimate jl given by (19) which satisfies the fixed width 
confidence interval condition 

Pr(|j&-ju| <e) > I -a. 

Proof. If a 2 = 0, then = 0, = 1 and fx = jU with probability one. Now con- 
sider the case of positive variance. The first stage yields a variance estimate satis- 
fying Pr(<7 2 > a 1 ) > 1 — a by the argument preceding the kurtosis bound in (13) 
applied with uncertainty a. The second stage yields Pr(|/i — ji \ <e) > 1 — a by the 
Berry-Esseen result (15), so long as ff > ff and M3 < K - max (a,« (J , £) 3 / 4 . The sec- 
ond condition holds because M3 < K -3 / 4 by Jensen's Inequality (Lin and Bai, 2010, 
8.4.b). Thus, in the two-stage algorithm we have 

Pr(|A-ju| <e) = E[Pr(|A-Ml < e | a")] 
>E[(l-ft)l ff < d ] 

> (1 -a)(l -a) = 1 -a. □ 

Remark 1. As pointed out earlier, the guarantees in this theorem require that the 
modified kurtosis of Y not exceed the specified upper bound K" max - As it is presented, 
Algorithm 1 takes as inputs, n a , £, and a, and uses these to compute K" max according 
to (13). The reason for doing so is that one might have a better intuition for n a , <£, 
and a. Alternatively, one may specify n a and k max and use (13) to compute £, or 
specify £ and k mia and use (13) to compute n a . The issue of how one should choose 
na, £, and k max in practice is discussed further in Section 5. 

Remark 2. In this algorithm it is possible to choose much smaller than n a if the 
sample variance is small. As a practical matter we suggest that if one is willing to 
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invest n a samples to estimate the variance then one should be willing to invest at 
least that many additional samples to estimate the mean. Therefore, in the numerical 
examples of Section 4 we use 

N^{e,a,a,M) :=max(« a ,min(A^ C heb(e,^,a),AfBE(e,cr,a,M))) (20) 

instead of (18) to determine the sample size for the sample mean. Because the vari- 
ance is typically harder to estimate accurately than the mean, one may wonder 
whether n a should be chosen greater than n^. However, for Monte Carlo simula- 
tion we only need the variance to one or two digits accuracy, whereas we typically 
want to know the mean to a much higher accuracy. By the error bound following 
from Chebychev's inequality (Theorem 1), the definition of Nn in (20) means that 
the fixed width confidence interval constructed by Algorithm 1 also holds for any 
random variables, Y, with small variance, namely, a 2 < E 2 an a , even if its kurtosis 
is arbitrarily large. 

As mentioned in the introduction, one frequently encountered case occurs when 
Y is a (i-variate function of a random vector X. Then jj, corresponds to the multi- 
variate integral in (6) and Theorem 5 may be interpreted as below: 

Corollary 1. Suppose that p : W 1 — > R is a probability density function, the inte- 
grand f : W 1 — > R has finite J2?4 norm as defined in (7), and furthermore f lies in the 
cone ^it max defined in (8), where K mlL x = jf max (a,n CT , £). It follows that Algorithm 
1 yields an estimate, (i, of the multidimensional integral /I defined in (6), which 
satisfies the fixed width confidence interval condition 

Pr(|/i-ju| <£) > I -a. 



3.4 Cost of the Algorithm 

The number of function values required by the two-stage Algorithm 1 is n c + 11^, 
the sum of the initial sample size used to estimate the variance of Y and the sample 
size used to estimate the mean of Y . Although n a is deterministic, n^ is a random 
variable, and so the cost of this algorithm might be best defined probabilistically. 
Moreover, the only random quantity in the formula for rin in (17) is d 2 , the upper 
bound on variance. Clearly this depends on the unknown population variance, a 2 , 
and we expect a 2 not to overestimate a 2 by much. Thus, the algorithm cost is 
defined below in terms of a 2 and the error tolerance (interval half-width) e. An 
upper bound on the cost is then derived in Theorem 6. 

Let A be any random algorithm that takes as its input, a method for generat- 
ing random samples, Yi,Yi,... with common distribution function F having vari- 
ance a 2 and modified kurtosis k. Additional algorithm inputs are an error tolerance, 
£, an uncertainty, a, and a maximum modified kurtosis, K" max - The algorithm then 
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computes fl = A(F 7 e,a 7 k max ), an approximation to jj. = E(Y), based on a total of 
Mot(e, OC, k max ,F) samples. The probabilistic cost of the algorithm, with uncertainty 
/3, for integrands of variance no greater than o"^ as and modified kurtosis no greater 
than k max is defined as 

Mot(e,a,j3,K- m ax,c7 m ax) := sup min{Ar : Pr[M ot (e, a, Krmx,F) <N}> 1-/3}. 

K"^ Kmax 
tf<°max 

Note that k mix is an input to the algorithm, but c max is not. The cost of an arbitrary 
algorithm, A may also depend on other parameters, such as n a and £ in our Algo- 
rithm 1, which are related to k mAX . However, this dependence is not shown explicitly 
to keep the notation simple. 

The cost of the particular two-stage Monte Carlo algorithm defined in Algorithm 
1 is 

_ sup min {/V : Pr^ + JV M (e, a, a, k^ x ) < N) > 1 - J3 } . 

C<Cmax 

Since n a is fixed, bounding this cost depends on bounding Np (e, <7, 6t, Kmax)> which 
depends on a as given by Algorithm 1 . Moreover, a can be bounded above using 
(12a) in Lemma 1. For k < K" max , 



1-/3 <Pr 




= Pr[d 2 <(jV(a,j3,e:)], 

where 



Noting that N ll (e,-,a, Kkax) is a non-decreasing function allows one to derive the 
following upper bound on the cost of the adaptive Monte Carlo algorithm. 

Theorem 6. The two-stage Monte Carlo algorithm for fixed width confidence in- 
tervals based on IID sampling described in Algorithm 1 has a probabilistic cost 
bounded above by 

N tot (e,a,j5 

< N ap {e,a,j5 , K- max , cw) := n a +N^(e, <7 max v(a,P , £), a, k^ x ). 

Note that the Chebychev sample size, Ncheb, defined in (14), the Berry-Esseen 
sample size, A^be, defined in (16), and thus all depend on a and e through 
their ratio, a/e. Thus, ignoring the initial sample used to estimate the variance, 
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W to ,(e,a,/3,K'max,cT m ax) is roughly proportional to a^Je 2 , even though a max is 
not a parameter of the algorithm. Algorithm 1 adoptively determines the sample 
size, and thus the cost, to fit the unknown variance of Y. Random variables, Y, with 
small variances will require a lower cost to estimate fi with a given error tolerance 
than random variables with large variances. 




(a) (b) 



Fig. 1 (a) The cost ratios of « up (e, 0.01, 0.01, if max , a) /N C lt(£, Cf, 0.01) for if max = 2, 10, and 100, 
with f!(j = 4000/f ra ax (dashed) and n a optimized (solid); (b) the optimal values of % (solid) and £ 
(dashed). 

Figure la shows the ratio of the upper bound of the cost, A up (£, 0.01 ,0.01 , K" ma x, cr), 
to the ideal CLT cost, A C lt(£, cr,0.01) = [(2.58c7/e) 2 ], for a range of a/e ratios 
and for KV nax = 2,10, and 100. In these graphs the formula defining A up in Theorem 
6 uses the alternative and somewhat costlier formula for Nu in (20). The dashed 
curves in Figure la show these cost ratios with n a = 4000 KVnax > which corresponds 
to £ rj 1.1. The solid curves denote the case where n a and <£ vary with a/e to min- 
imize iV U p. Figure lb displays the optimal values of n a (solid) and £ (dashed). In 
both figures, higher curves correspond to higher values of KV nax . 

Here, Aclt denotes the ideal cost if one knew the variance of Y a priori and knew 
that the distribution of the sample mean was close to Gaussian. The cost ratio is the 
penalty for having a guaranteed fixed width confidence interval in the absence of 
this knowledge about the distribution of Y. For smaller values of New, equivalently 
smaller a/e, this cost ratio can be rather large. However the absolute effect of this 
large penalty is mitigated by the fact that the total number of samples needed is not 
much. For larger A^clt, equivalently larger a/e, the cost ratio approaches somewhat 
less than 1.4 in the case of optimal n a and €, and somewhat less than 2 for n a = 
lOOOiw. 

The discontinuous derivatives in the curves in Figure 1 arise from the minimum 
and maximum values arising in formulas (16) and (20) for A^be and , respectively. 
Taking the upper dashed curve in Figure la as an example, for Nqlt l ess than about 
3.5 x 10 4 , Nfi = n a . For Aclt from about 3.5 x 10 4 to about 6 x 10 6 , corresponds 
to the second term in the minimum in the Berry-Esseen inequality, (16), i.e., the non- 
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uniform term. For Nqlt greater than 6 x 10 6 , corresponds to the first term in the 
minimum in the Berry-Esseen inequality, (16), i.e., the uniform term. 

The ideal case of optimizing n a and £ with respect to a/e is impractical, since 
(7 is not known in advance. Our suggestion is to choose € around 1.1, and then 
choose n (j as large as needed to ensure that KVnax is as large as desired. For example 
with € = 1.1 and ic max = 2,10, and 100 we get n a = 6593, 59311, and 652417 
respectively. 



4 Numerical Examples 

4.1 Univariate Fooling Functions for Deterministic Algorithms 

Several commonly used software packages have automatic algorithms for integrat- 
ing functions of a single variable. These include 

• quad in MATLAB (The Math Works, Inc., 2012), adaptive Simpson's rule based 
on adapt sim by Gander and Gautschi (2000), 

• quadgk in MATLAB (The MathWorks, Inc., 2012), adaptive Gauss-Kronrod 
quadrature based on quadva by Shampine (2008), and 

• the chebfun (Hale et al, 2012) toolbox for MATLAB (The MathWorks, Inc., 
2012), which approximates integrals by integrating interpolatory Chebychev 
polynomial approximations to the integrands. 

For these three automatic algorithms one can easily probe where they sample 
the integrand, feed the algorithms zero values, and then construct fooling functions 
for which the automatic algorithms will return a zero value for the integral. Figure 
2 displays these fooling functions for the problem ji = J f(x) dx for these three 
algorithms. Each of these algorithms is asked to provide an answer with an absolute 
error no greater than 10~ 14 , but in fact the absolute error is 1 for these fooling 
functions. The algorithms quad and chebfun sample only about a dozen points 
before concluding that the function is zero, whereas the algorithm quadgk samples 
a much larger number of points (only those between and 0.01 are shown in the 
plot). 



4.2 Integrating a Single Hump 

Accuracy and timing results have been recorded for the integration problem jj. = 
J[0 i] d f( x ) f° r a sm gl e hump test integrand 
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quad quadgk chebfun 



Fig. 2 Plots of fooling functions, /, with fl = J f(x) dx = 1, but for which the corresponding 
algorithms return values of fl = 0. 

Here x is a d dimensional vector, and ao,bo, . . . ,bd,c\,. . . ,Cd,h\,. . . ,h<j are param- 
eters. Figure 3 shows the results of different algorithms being used to integrate 500 
different instances of /. For each instance of /, the parameters are chosen as fol- 
lows: 

• b\,. . . ,bd £ [0.1, 10] with log(bj) being i.i.d. uniform, 

• ci,... ,Cd £ [10~ 6 , 1] with log(c 7 ) being i.i.d. uniform, 

• hi , . . . , h<j £ [0, 1] with hj being i.i.d. uniform, 

• bo chosen in terms of the b\, . . . ,b^,ci, . . . ,Cd,hi, . . . ,hd to make a 2 = \\f — fl\\2 £ 
[10~ 2 , 10 2 ], with log(a) being i.i.d. uniform for each instance, and 

• «o chosen in terms of the bo,--- ,bd,ci, . . . ,Cd,h\, . . . to make fl = \ . 

These ranges of parameters are chosen so that the algorithms being tested fail to 
meet the error tolerance a significant number of times. 

These 500 random constructions of / with d = 1 are integrated using quad, 
quadgk, chebfun, Algorithm 1, and an automatic quasi-Monte Carlo algorithm 
that uses scrambled Sobol' sampling (Owen, 1995, 1997a,b; Matousek, 1998; Hong 
and Hickernell, 2003; Dick and Pillichshammer, 2010). For the Sobol' sampling al- 
gorithm the error is estimated by an inflation factor of 1 . 1 times the sample standard 
deviation of 8 internal replicates of one scrambled Sobol' sequence (Owen, 2006). 
The sample size is increased until this error estimate decreases to no more than the 
tolerance. We have not yet found simple conditions on integrands for which this 
procedure is guaranteed to produce an estimate satisfying the error tolerance, and 
so we do not discuss it in detail. We are however, intrigued by the fact that it does 
seem to perform rather well in practice. 

For all but chebfun, the specified absolute error tolerance is e = 0.001. The 
algorithm chebfun attempts to do all calculations to near machine precision. The 
observed error and execution times are plotted in Figure 3. Whereas chebfun uses 
a minimum of 2 3 + 1 = 9 function values, the figure labeled "chebfun (heavy 
duty)" displays the results of requiring chebfun to use at least 2 8 + 1 = 257 func- 
tion values. Algorithm 1 takes a = 0.01, and £ = 1.1. For the plot on the left, 
/ifj = 2 13 = 8192, which corresponds to K" max = 2.24. For the heavy duty plot on 
the right, n a = 2 18 = 262144, which corresponds to k max = 40.1. The same initial 
sample sizes are used for the Sobol' sampling algorithm. 
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Algorithm 1 



10 10 10 10 

Error 

Algorithm 1 (heavy duty) 




Sobol' 



Sobol' (heavy duty) 



Fig. 3 Execution times and errors for test function (21) for d = 1 and error tolerance e = 10~ 3 , 
and a variety of parameters giving a range of a and if. Those points to the left/right of the dashed 
vertical line represent successes/failures of the automatic algorithms. The solid line shows that 
cumulative distribution of actual errors, and the dot-dashed line shows the cumulative distribution 
of execution times. For the Algorithm 1 the points labeled * are those for which the Corollary 1 
guarantees the error tolerance. 
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Figure 3 shows that quad and quadgk are quite fast, nearly always providing 
an answer in less than 0.01 seconds. Unfortunately, they successfully meet the er- 
ror tolerance only about 30% of the time for quad and 50-60% of the time for 
quadgk. The difficult cases are those where c\ is quite small, and these algorithms 
miss the sharp peak. The performance of chebf un is similar to that of quad and 
quadgk. The heavy duty version of chebf un fares somewhat better. For both of 
the chebf un plots there are a significant proportion of the data that do not appear 
because their errors are smaller than 10~ 5 . 

In the plots for Algorithm 1 the alternative and somewhat costlier formula for Np 
in (20) is employed. An asterisk is used to label those points satisfying k < k max , 
where k is defined in (7). All such points fall within the prescribed error toler- 
ance, which is even better than the guaranteed confidence of 99%. For Algorithm 1 
(heavy duty) K" max is larger, so there are more points for which the guarantee holds. 
Those points labeled with a dot, are those for which k > k max , and so no guar- 
antee holds. The points labeled with a diamond are those for which Algorithm 1 
attempts to exceed the cost budget that we set, i.e., it wants to choose tin such that 
«<r + «/j > Mnax := 10 9 . In these cases is chosen as |10 9 — n a \ , which often is 
still large enough to get an answer that satisfies the error tolerance. Algorithm 1 
performs somewhat more robustly than quad, quadgk, and chebf un, because it 
requires only a low degree of smoothness and takes a fairly large minimum sample. 
Algorithm 1 is generally much slower than the other algorithms because it does not 
assume any smoothness of the integrand. The more important point is that Algo- 
rithm 1 has a guarantee, whereas to our knowledge, the other routines do not. 

From Figure 3, the SoboF sampling algorithm is more reliable and takes less 
time than Algorithm 1. This is due primarily to the fact that in dimension one, 
SoboF sampling is equivalent to stratified sampling, where the points are more 
evenly spread than IID sampling. 

Figure 4 repeats the simulation shown in Figure 3 for the same test function (21), 
but now with d = 2, ... ,8 chosen randomly and uniformly. For this case the univari- 
ate integration algorithms are inapplicable, but the multidimensional routines can 
be used. There are more cases where the Algorithm 1 tries to exceed the maximum 
sample size allowed, i.e., (n a +n^)d > A^ max := 10 9 , but the behavior seen for d = 1 
still generally applies. 



4.3 Asian Geometric Mean Call Option Pricing 

The next example involves pricing an Asian geometric mean call option. Suppose 
that the price of a stock S at time t follows a geometric Brownian motion with 
constant interest rate, r, and constant volatility, v. One may express the stock price 
in terms of the initial condition, 5(0), as 



S(t) =S{0)exp[(r-v z /2)t + vB(t)], t > 0, 
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where B is a standard Brownian motion. The discounted payoff of the Asian geo- 
metric mean call option with an expiry of T years, a strike price of K, and assuming 
a discretization at d times is 

Y = m&x(^^/^S{T/d)S{2T/d)---S{T{d-\)/d)^/^] l l d -K^- rT . 

(22) 

The fair price of this option is ji = E(T). One of our chief reasons for choosing 
this option for numerical experiments is that its price can be computed analytically, 
while the numerical computation is non-trivial. 

In our numerical experiments, the values of the Brownian motion at different 
times required for evaluating the stock price, B(T /d),B(2T /d), . . . ,B(T), are com- 
puted via a Brownian bridge construction. This means that for one instance of 
the Brownian motion we first compute B(T), then B(T/2), etc., using indepen- 
dent Gaussian random variables X\ , . . . ,Xd, suitably scaled. The Brownian bridge 
accounts for more of the low frequency motion of the stock price by the Xj with 
smaller j, which allows the SoboF sampling algorithm to do a better job. 

The option price, ji = E(F), is approximated by Algorithm 1 and the Sobol' 
sampling algorithm using an error tolerance of e = 0.05, and compared to the an- 
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alytic value of ji. The result of 500 replications is given in Figure 5. Some of the 
parameters are set to be fixed values, namely, 

5(0) =£ = 100, T = l, r = 0.03. 

The volatility, v, is drawn uniformly between 0. 1 and 0.7. The number of time steps, 
d, is chosen to be uniform over {1,2,4,8,16,32}. The true value of ji for these 
parameters is between about 2.8 and 14. 




1(T 5 10"" 1Cf 3 1(T Z 10"' 10° 1(T 5 10"" 1Cf 3 1(T Z 10"' 10° 

Error Error 



Algorithm 1 Algorithm 1 (heavy duty) 




10~ 5 10"" 10~ 3 10~ 2 10"' 10° 10~ 5 10"" 10~ 3 10~ Z 10"' 10° 

Error Error 

Sobol' Sobol' (heavy duty) 



Fig. 5 Execution times and errors for the Asian geometric mean call option for d = 1 , 2, 4, 8, 16, 32 
and e = 0.05. 



For this example the true kurtosis of Y is unknown. Both Algorithm 1 and the 
Sobol' sampling algorithm compute the option price to the desired error tolerance 
with high reliability. For the IID sampling Algorithm 1 and the ordinary Sobol' 
sampling algorithm it can be seen that some of the errors are barely under the error 
tolerance, meaning that the sample size is not chosen too conservatively. For the 
heavy duty Sobol' algorithm, the high initial sample size seems to lead to smaller 
than expected errors and larger than necessary computation times. 
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5 Discussion 

Practitioners often construct CLT-based confidence intervals with the true variance 
estimated by the sample variance, perhaps multiplied by some inflation factor. Of- 
ten, this approach works, but it has no guarantee of success. The two-stage algorithm 
presented here is similar to the approach just described, but it carries guarantees. 
These are derived by employing Cantelli's inequality to ensure a reliable variance 
upper bound, and by employing a Berry-Esseen inequality to ensure a large enough 
sample for the sample mean. 

In certain cases our procedure multiplies the computational cost by a large factor 
such as 2 or 10 or even 100 compared to what one might spend based on the CLT 
with a known value of a (see Figure 1). While this seems inefficient, one should 
remember that the total elapsed time may still be well below several seconds. Fur- 
thermore, one typically does not know a in advance, and our adaptive algorithm 
estimates a and then an appropriate sample size from the data. Our algorithmic 
cost will be low when the unknown a is small and large when a is large. 

Like any algorithm with guarantees, our algorithm does need to make assump- 
tions about the random variable Y. We assume a known bound on the kurtosis of Y, 
either specified directly or implied by the user's choice of the sample size for esti- 
mating the variance, n a , and the variance inflation factor, £ 2 . This is a philosophical 
choice. We prefer not to construct an algorithm that assumes a bound on the vari- 
ance of Y , because such an algorithm would not be guaranteed for cY with \c\ large 
enough. If our algorithm works for Y, it will also work for cY, no matter how large 
\c\ is. 

In practice the user may not know a priori if k < K" max since it is even more 
difficult to estimate k from a sample than it is to estimate a 2 . Thus, the choice of 
Kmax relies on the user's best judgement. Here are a few thoughts that might help. 
One might try a sample of typical problems for which one knows the answers and 
use these problems to suggest an appropriate >c max . Alternatively, one may think of 
Kmax not as a parameter to be prescribed, but as a reflection of the robustness of 
one's Monte Carlo algorithm having chosen a, n a and £. The discussion at the end 
of Section 3.4 provides guidance on how to choose n a and € to achieve a given 
^max in a manner that minimizes total computational cost. Briefly, one should not 
skimp on n a , but choose n a to be several thousand times K" max and employ a £ that 
is relatively close to unity. Another way to look at the Theorem 5 is that, like a 
pathologist, it tells you what went wrong if the two-stage adaptive algorithm fails: 
the kurtosis of the random variable must have been too large. In any case, as one 
can see in Figure 1, in the limit of vanishing e/c, i.e., Ncu — > °°. the choice of 
Kmax makes a negligible contribution to the total cost of the algorithm. The main 
determinant of computational cost is e/a. 

Bahadur and Savage (1956) prove in Corollary 2 that it is impossible to construct 
exact confidence intervals for the mean of random variable whose distribution lies 
in a set satisfying a few assumptions. One of these assumptions is that the set of 
distributions is convex. This assumption is violated by our assumption of bounded 
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kurtosis in Theorem 5. Thus, we are able to construct guaranteed confidence inter- 
vals. 

Our algorithm is adaptive because « M is determined from the sample variance. 
Information-based complexity theory tells us that adaptive information does not help 
for the integration problem for symmetric, convex sets of integrands, /, in the worst 
case and probabilistic settings (Traub et al, 1988, Chapter 4, Theorem 5.2. 1 ; Chapter 
8, Corollary 5.3.1). Here, in Corollary 1 the cone, ^jf max , although symmetric, is not 
a convex set, so it is possible for adaption to help. 

There are a couple of areas that suggest themselves for further investigation. One 
is relative error, i.e., a fixed width confidence interval of the form 

Pr[]/x — Al <e|M|] > 1 -«■ 

Here the challenge is that the right hand side of the first inequality includes the 
unknown mean. 

Another area for further work is to provide guarantees for automatic quasi-Monte 
Carlo algorithms. Here the challenge is finding reliable formulas for error estima- 
tion. Typical error bounds involve a semi-norm of the integrand that is harder to 
compute than the original integral. For randomized quasi-Monte Carlo an estimate 
of the variance of the sample mean using n samples does not tell you much about 
the variance of the sample mean using a different number of samples. 
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